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Abstract -Ranking problem has attracted much attention in real systems. How to design a robust 
ranking method is especially significant for online rating systems under the threat of spamming 
attacks. By building reputation systems for users, many well-performed ranking methods have 
been applied to address this issue. In this Letter, we propose a group-based ranking method 
that evaluates users’ reputations based on their grouping behaviors. More specifically, users are 
assigned with high reputation scores if they always fall into large rating groups. Results on three 
real data sets indicate that the present method is more accurate and robust than correlation-based 
method in the presence of spamming attacks. 


Introduction. — With the rapid development of the 
Internet, billions of services and objects are online for us to 
choose [1]. At the same time, the problem of information 
overload troubles us every day [2-4]. Therefore, many web 
sites (Amazon, Ebay, MovieLens, Netlfix, etc.) introduce 
online rating systems, where users can give discrete rat¬ 
ings to objects. In turn, the ratings of an object serve as a 
reference and latter affect other users’ decisions [5,6]. Ba¬ 
sically, high ratings can promote the consumption, while 
low ratings play the opposite role [7]. In real cases, some 
users may give unreasonable ratings since they are sim¬ 
ply unfamiliar with the related field [8], and some others 
deliberately give biased ratings for various psychosocial 
reasons [9-13]. These widely existed distort ratings can 
harm or boost objects’ reputation, mislead others’ judg¬ 
ments, and affect the evolution of rating systems [14-16]. 
Due to the negative effects of spamming attacks, how to 
design a robust method for online rating systems is be¬ 
coming an urgent task [17-19]. 

To solve this problem, normally, building a reputation 
system for users is a good way [20-28]. Laureti et al. [25] 
proposed an iterative refinement (IR) method, where a 
user’s reputation is inversely proportional to the difference 
between his ratings and the estimation of the correspond- 
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ing objects’ quality (z.e., weighted average rating). The 
reputation and the estimated quality are iteratively calcu¬ 
lated until they become stable. An improved IR method 
is proposed in [26], by assigning trust to each individual 
rating. Later, Zhou et al. [27] proposed the correlation- 
based ranking (CR) method that is robust to spamming 
attacks, where a user’s reputation is iteratively determined 
by the correlation between his ratings and objects’ esti¬ 
mated quality. Very recently, by introducing a reputa¬ 
tion redistribution process and two penalty factors, Liao 
et al. [28] further improved the CR method. 

In the majority of previous works [29,30], a single stan¬ 
dard objects’ quality is required in determining users’ rep¬ 
utations, with an underlying assumption that every object 
is associated with a most objective rating that best reflect 
its quality. However, in real cases, one object may have 
multiple valid ratings, since the ratings are subjective and 
can be affected by users’ background [31-34]. In the pres¬ 
ence of more than one reasonable answer to a single task, 
Tian et al. [29] analyzed the group structure of schools 
of thought in solving the problem of identifying reliable 
workers as well as unambiguous tasks in data collection. 
Specifically, a worker who is consistent with many other 
workers in most of the tasks is reliable, and a task whose 
answers form a few tight clusters is easy and clear. Anal¬ 
ogously, in the online rating systems, one object’s quality 
is clear if its ratings are centralized, while it is not clear if 
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Fig. 1: (Color online) Illustrating the group-based method. The number besides the gray arrow marks the step of the procedure, 
(a) The original weighed bipartite network, G. (b) The corresponding rating matrix, A. The row and column correspond to 
users and objects, respectively. The symbol stands for a non-rating, which can be ignored in the calculation, (c) The groups 
of users, T, after being grouped according to their ratings. Take O 2 as an example (green vertical box). As U 2 and U 4 rated 
4 to O 2 , they are put into group T^- (d) The sizes of groups, A, e.g. A ^2 = 2 as T 4,2 = {^ 2 ,^ 4 }. (e) The rating-rewarding 
matrix, A*, constructed by normalizing A by column, e.g, A 4 2 = 2/(1 + 2 + 2) = 0.40. (f) The rewarding matrix, A', obtained 
by mapping matrix A referring to A*, e.g. A ' 4j2 = 0.40. (g) The ranking list of users based on reputation R. Take U 3 as an 
example (blue horizontal box in (f)), R 3 = /jl(A 3 ) / a (A 3 ) = 3.75. Setting spam list’s length as L = 2, then U 5 and U 3 (red 
dashed box) are detected as spammers. 


the ratings are widely distributed. Under this framework, 
a single estimation of an object’s quality is no longer ap¬ 
plicable [30]. Practically, a random rating to objects with 
confusing quality should be acceptable since no single rat¬ 
ing can dominant its true quality, while a biased rating to 
objects with clear quality is unreasonable. Users who are 
consistent with the majority in most ratings will form big 
groups and be trusted since herding is a well-documented 
feature of human behaviour [6, 35]. Users who always give 
distort ratings will form relatively small groups and be 
highly suspected since unreasonable or biased ratings are 
discordant [29]. These ideas bring us a promising way 
to build reputation systems based on users’ grouping be¬ 
haviour instead of solving the crucial problem of estimat¬ 
ing objects’ true qualities as before. 

In this Letter, we propose a group-based ranking (GR) 
method for online rating systems with spamming attacks. 
By grouping users according to their ratings, users’ rep¬ 
utations are determined according to the corresponding 
group sizes. If they always fall into large groups, their 
reputations are high, on the contrary, their reputations 
are low. Extensive experiments on three real data sets 
(MovieLens, Netflix and Amazon) suggest that the pro¬ 
posed method outperforms the CR method. 

Method. — The online rating system is naturally de¬ 
scribed by a weighed bipartite network G = {[/, 0,F}, 
where U = {f/i, U 2 ,..., U m }, O = {Oi,0 2j ...,O n } J E = 
{Ei, E 2 ,..., Ei} are sets of users, objects and ratings, re¬ 
spectively [36]. The degree of a user i and an object 


a are denoted as ki and k ax respectively. Here, we use 
Greek and Latin letters, respectively, for object-related 
and user-related indices to distinguish them. Consider¬ 
ing a discrete rating system, the bipartite network can be 
represented by a rating matrix A [37], where the element 
di a G U = {uq, CJ 2 , cj z } is the weight of the link con¬ 
necting node Ui and node O a , z.e., the rating given by 
user i to object a. In a reputation system, each user i will 
be assigned a reputation, denoted as Ri. The users with 
very low reputations are detected as spammers. 

The GR method works as follows. Firstly, we group 
users according to their ratings. Specifically, for an object 
a , we put users who gave the rating uj s into group T sa : 

Fsa = {Ui I a ia sa LJ s ,i = 1,2, (1) 

Obviously, user i belongs to ki different groups. Sec¬ 
ondly, we calculate the sizes of all groups A sa = |T sa |, i.e. 
the number of users who gave the rating uj s to object a. 
Thirdly, we establish a matrix A*, named rating-rewarding 
matrix, by normalizing A per column: 


Fourthly, referring to A*, we map the original rating ma¬ 
trix A to a matrix A !, named rewarding matrix. More 
specifically, the rewarding that a user i obtain from his 
rating a ia is defined as A' ia = A* a , where ai a = uo s . A' ia 
is null if user i has not yet rated object a. 

Then, we assign reputations to users according to their 
rewarding. On the one hand, if the average of a user’s 
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Table 1: Basic statistics of the three real data sets, m is the 
number of users, n is the number of objects, (ku) is the average 
degree of users, (ko) is the average degree of objects, and S = 
l/mn is the sparsity of the bipartite network. 


Data set 

m 

n 

(ku) 

(ko) 

s 

MovieLens 

943 

1682 

106 

60 

0.063 

Netflix 

1038 

1215 

47 

40 

0.039 

Amazon 

662 

1500 

36 

15 

0.023 


rewarding is small, most of his ratings must be deviated 
from the majority, suggesting that he is highly suspected. 
On the other hand, if the rewarding varies largely, he is 
also untrustworthy for his unstable rating behavior. Based 
on these considerations, we defined user V s reputation as 


R, = 




(3) 


where (i and a are functions of mean value and standard 
deviation, respectively. Specifically, the mean value of A[ 
is defined as 

= ( 4 ) 

a. 1 

and the standard deviation of A[ is defined as 


<?( A i) = 


/ E a ( 4 „-/*( 4)) 2 


(5) 


In fact, Ri is the same with the inverse of the coefficient of 
variation [38] of vector A', which shows the dispersion of 
the frequency distribution of user V s rewar dings. Finally, 
we sort users in ascending order by reputation, and deem 
the top-L ones as detected spammers. A visual represen¬ 
tation of GR method is given in fig. 1. 

Data and Metrics. — We consider three commonly 
studied real data sets, MovieLens, Netflix and Amazon, to 
test the accuracy of GR method. MovieLens and Netflix 
contain ratings on movies, provided by GroupLens project 
at University of Minnesota (www.grouplens.org) and re¬ 
leased by the DVD rental company Netflix for a contest 
on recommender systems (www.netflixprize.com), respec¬ 
tively. Amazon contains ratings on products (e.g. books, 
music, etc) crawled from amazon.com [39]. All the three 
data sets use a 5-point rating scale with 1 being the worst 
and 5 being the best. Herein, we sampled and extracted 
three smaller data sets from the original data sets, respec¬ 
tively, by choosing users who have at least 20 ratings and 
objects having been rated by these users since it’s hard 
to tell whether small-degree users are spammers [28]. The 
basic statistics of data sets are summarized in table 1. 

Generating artificial spammer. Two types of distorted 
ratings are widely found in real rating systems, namely, 
malicious ratings and random ratings. The malicious rat¬ 
ings are from spammers who always gives minimum (max¬ 
imum) allowable ratings to push down (up) certain target 
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Fig. 2: (Color online) The relation between reputation R and 
rating error 5 (bins) in different methods, (a) and (b) are for 
the CR and GR method, respectively. S and R are respectively 
normalized for comparison under different data sets. 


objects [9,10]. The random ratings mainly come from 
some naughty users or test engineers who randomly give 
meaningless ratings [14,15]. As spammers are unknown 
in real data, to test the method, we manipulate the three 
real data sets by adding either type of artificial spammers 
(i.e. malicious or random) at one time. 

In the implementation, we randomly select d users and 
assign them distorted ratings: (i) integer 1 or 5 with the 
same probability (z.e., 0.5) for malicious spammers, and 
(ii) random integers in {1, 2, 3,4, 5} for random spammers. 
Thus, the ratio of spammers is q = d/m. To study the ef¬ 
fects of spammers’ activity, we define p = k/n the activity 
of spammers, where k is the degree of each spammer. Here 
k is a tunable parameter, that is, if k is no more than a 
spammer’s original degree, we randomly select his/her k 
ratings and replace them with distorted ratings and the 
un-selected ratings are ignored. Otherwise, after replac¬ 
ing all the spammer’s original ratings, we randomly select 
remaining number of non-rated objects and assign them 
distorted ratings. 

Metrics for evaluation. To evaluate the performance 
of ranking methods, we adopt two commonly used metrics: 
recall [40] and AUC (the area under the ROC curve) [41]. 
The recall value measures to what extent the spammers 
can be detected in the top-L ranking list, 


Rc(L) 


d'(L) 

d 


(6) 


where d'(L) < d is the number of detected spammers in 
the top-L list. A higher R c indicates a higher accuracy. 

Note that, R c only focuses on the top-L ranks, and thus 
we also consider an L-independent metric called AUC. 
Provided the rank of all users, AUC value can be inter¬ 
preted as the probability that a randomly chosen spammer 
is ranked higher than a randomly chosen non-spammer. 
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Fig. 3: (Color online) The recall R c of different algorithms as 
a function of the length of the list L. (a), (b) and (c) are for 
malicious spammers, (d), (e) and (f) are for random spammers 
with d — 50 being fixed. The parameter p is set as about 0.05, 
0.03 and 0.02 for MovieLens, Netflix and Amazon, respectively. 
The results are averaged over 100 independent realizations. 


To calculate AUC, at each time we randomly pick a spam¬ 
mer and a non-spammer to compare their reputations, if 
among N independent comparisons, there are N' times 
the spammer has a lower reputation and N" times they 
have the same reputation, the AUC value is 


AUC = 


N' + 0.5 AT" 

N 


(7) 


If all users are ranked randomly, the AUC value should 
be about 0.5. Therefore, the degree to which the value 
exceeds 0.5 indicates how better the method performs than 
pure chance [42]. 


Results. — According to the single standard assump¬ 
tion, each object a has a true quality, denoted by Q a . As 
the true quality is unknown in reality, taking the average 
rating as an estimation of Q a is the most straightforward 
way. Then, the rating error of user i is defined as 


Si = 


XL I &ioc Qo 

h 


( 8 ) 


where Q a = 1/ is the average rating of object 

a and a runs over all objects being rated by user i. A 
reasonable ranking method should give high reputations 
to the users with low rating errors, i.e. Ri should be 
negatively correlated with S^. 

Figure 2 shows the relation between users’ rating errors 
and reputations evaluated by CR and GR, respectively. 
As users’ rating errors are continuous and with different 
scales, we normalize and divide them into bins with the 
length 0.05. As one can see, the two methods both as¬ 
sign high reputations for users with small Si, while GR 
method outperforms CR method by stably assigning low 
reputations (see fig. 2(b)) for users with large Si. Further¬ 
more, to quantify the correlation, we calculate the Pearson 


Table 2: AUC of different algorithms for the real data sets with 
d — 50 being fixed. The parameter p is set as about 0.05, 0.03 
and 0.02 for MovieLens, Netflix and Amazon, respectively. The 
results are averaged over 100 independent realizations. 


Data set 

Malicious 

spamers 

Random 

spamers 

CR 

GR 

CR 

GR 

MovieLens 

0.876 

0.994 

0.914 

0.959 

Netflix 

0.543 

0.977 

0.668 

0.930 

Amazon 

0.824 

0.941 

0.877 

0.949 


correlation coefficient p [38] between Ri and Si. Specifi¬ 
cally, p = -0.956 (-0.949), -0.906 (-0.872) and -0.966 
(—0.816) after applying GR (CR) method to MovieLens, 
Netflix and Amazon, respectively. The larger negative cor¬ 
relation suggest that GR method is better on evaluating 
users’ reputations. 

Effectiveness and efficiency. To test the effectiveness 
of ranking algorithms, based on the three real data sets, 
we first generate artificial data sets with 50 spammers (i.e. 
d = 50). Each data set is only with one type of spammers: 
malicious or random. On the generated data sets, we cal¬ 
culate recall of different algorithms as a function of the 
spammer list’s length L. As shown in fig. 3, GR method 
has remarkable advantage over the CR method on detect¬ 
ing both types of spammers, especially when L is larger 
than d. We also note that R c of ranking malicious spam¬ 
mers is a little higher than that of random spammers when 
L is smaller than d, which implies that to detect random 
spammers is relatively harder. 

Results of AUC values are shown in table 2, where one 
can see that AUC values of GR method is higher than that 
of the CR method for every data set, suggesting that GR 
method has significant advantage towards the CR method. 
It also shows that the CR method is better at detecting 
random spammers than malicious spammers, while GR 
method is inverse. In addition, it is worthy to be noticed 
that the AUC is generally lower in Netflix, especially for 
the CR method. One possible explanation is that there are 
more harmful spammers in Netflix and the CR method is 
very sensitive to “real” spammers [27,28]. Additionally, in 
Netflix, there are more small degree objects whose quality 
will be considered higher by the CR method [28], which 
may also lead to the biased ranking. 

Robustness against spammers. We then study the ro¬ 
bustness of different methods by varying p (the ratio of 
objects rated by spammers) and q (the ratio of spammers). 
In the following, we set the length of detected spam list 
being equal to the number of artificial spammers, namely, 
L = d. Figure 4 shows the recall obtained by GR method. 
The ranges of p and q are personalized set for different 
data sets referring to their sparsity. One can observe that, 
overall, GR method has better performance on detecting 
malicious spammers, especially whenp is small (i.e. spam¬ 
mers are of small degree). Moreover, when q is small, the 
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Fig. 4: (Color online) The effectiveness of GR method. The 
color marks recall R c . q and p are ratio of spammers and ratio 
of objects rated by spammers, respectively, (a), (b) and (c) 
are for malicious spammers, (d), (e) and (f) are for random 
spammers. The parameter is set as L — d. The results are 
averaged over 100 independent realizations. 

recalls of ranking random spammers are low for Movie- 
Lens and Netflix data sets (see figs. 4(d) and 4(e)), while 
for Amazon data set, the recall of ranking both two types 
of spammers is low (see figs. 4(c) and 4(f)). In addition, 
the recall positively increases with q. These observations 
suggest that (i) detecting malicious spammers is easier 
than random spammers, (ii) MovieLens and Netflix may 
contain more “real” spammers, and (iii) GR method is 
powerful to detect spammers who only rate a small num¬ 
ber of objects. 

To comprehensively compare the performance of GR 
and CR methods, we calculate the difference of recall be¬ 
tween the two, formulateed as A R c = R^ R — R^ R . As 
shown in fig. 5, A R c is above 0 in most area, suggesting 
that the overall performance of GR method is better. In 
detail, GR method has remarkable advantage in detecting 
malicious spammers, while the advantage is not obvious 
for random spammers. In addition, A R c is big when p 
and q are small, implying that GR method is more robust 
against a small number of small-degree spammers, which 
are usually difficult to be detected out. 

Conclusions and Discussion. — In summary, we 
have proposed a group-based (GR) method to solve the 
ranking problem in online rating systems with spamming 
attacks. By grouping users according to their ratings, we 
construct a rating-rewarding matrix according to the cor¬ 
responding group sizes, base on which we map a user’s 
rating vector to rewarding vector. Then, this user’s repu¬ 
tation is assigned according to the inverse of dispersion of 
frequency distribution of his rewarding vector. Extensive 
experiments showed that the proposed method is effective 
on evaluating users’ reputations, especially for those with 
high rating errors. In testing with the generated data with 
two types of artificial spammers, GR method gives higher 
performance in both accuracy and robustness compared 
with the correlation-based ranking (CR) method, espe¬ 


MovieLens Netflix Amazon 



p p p 

Fig. 5: (Color online) The comparison of GR and CR methods. 
The color marks A R c if A R c > 0, otherwise the color is green, 
meaning that the CR method is better, (a), (b) and (c) are for 
malicious spammers, (d), (e) and (f) are for random spammers. 
The parameter is set as L — d. The results are averaged over 
100 independent realizations. 

cially on resisting small-degree spammers. Interestingly, 
the accuracy of spam detection on Netflix data set is low 
for both the two methods, indicating that there are more 
original distort ratings in Netflix data sets, which is in 
accordance with some previous studies [27,28]. 

The proposed method has several distinguishing charac¬ 
teristics, differentiating it from current users’ reputation 
allocation procedures: (i) The method assigns users’ rep¬ 
utation by considering the grouping behavior of users in¬ 
stead of based on the estimation of objects’ true qualities, 
(ii) The method is with high performance in both accuracy 
and robustness, especially when dealing with small-degree 
spammers’ attacks, (iii) The method is very efficient, as its 
time complexity [43] is 0(m 2 ), which is significantly lower 
than most of previously proposed iterative methods. As 
further improvement, we could consider introducing this 
method to an iterative process [25-28], applying it to con¬ 
tinuous rating systems [44,45], and considering the effect 
of long-term evolution of online rating systems [16]. 
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